Information processing method and system for synchronization of biomedical data

ABSTRACT

An information processing method and system, for synchronization of disease progression data of individual patients, includes receiving disease progression data in an aperiodic form and representing the disease progression data as a set of functions having finite asymptotic values. The parameters of the set of functions are clustered and the step of representing the disease progression data as a set of functions includes transforming the functions into time invariant form and thereby synchronizing individual patient data that is clustered.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] The application is a continuation-in-part to PCT ApplicationSerial No. PCT/US02/17015 filed on May 31, 2002, which claims priorityto provisional application serial No. 60/294,638 filed on Jun. 1, 2001,the contents of which are incorporated herein in their entireties.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] This invention relates generally to the field of diseasestratification and staging which can be used in predictive medicine toassess disease progression. More specifically, the present inventionrelates to synchronization of biomedical data, such as diseaseprogression data, so that disease progression for individuals can beanalyzed more meaningfully.

[0004] 2. Description of the Related Art

[0005] Modern medicine makes use of disease-specific knowledge to: (a)select the best and most cost-effective therapy for an individualpatient; and (b) guide the development of: (i) the next generation ofdiagnostics, (ii) therapeutic drugs, (iii) health-care products, and(iv) lifestyle recommendations. Knowledge about a particular patient isderived from observations of that patient. These observations mayinclude family history, findings from a physical examination, blood andurine test results, imaging studies such as MRI and CT, and the like;genetic information is also being obtained more frequently. In addition,gene-expression and protein-expression data from microarray technologywill soon be available for clinical use.

[0006] Increasingly, traditional disease classifications are beingsubdivided into categories according to the mechanism or generesponsible, even though all categories share common clinical features.This subdividing process is known as “disease stratification.”Stratification can be used to select the most appropriate diagnostic andtherapeutic course for a patient, and to predict outcomes. It can alsobe used to define appropriate stratum-specific targets for drugdevelopment. Generally, stratification has been based on: (a) a singlesalient biochemical marker; (b) obvious differences in response tocurrent therapy; or (c) differences in particular genes.

[0007] One of the main reasons for obtaining diagnostic information isto determine the stage of progression of a patient's disease. Thisinformation is critical to determining the appropriate therapy for thedisease. In the case of cancer, the stage of the disease will determinewhether surgery, radiation therapy, chemotherapy, or a combination ofthe above is most appropriate, and will further determine the exactapproach to each. In the case of kidney disease, the stage of diseasewill determine whether the disease is best treated with medicine, dietand lifestyle changes, or whether dialysis and transplantation need tobe considered. By way of another example, staging and evaluation ofpostmenopausal osteoporosis can be used to balance the benefits ofhormone replacement therapy against the risks of adverse effects fromestrogen use.

[0008] At the current state of clinical practice, both stratificationand staging involve ambiguity and overlap. Single-disease markers failto give a complete picture of disease progression. In assessingdiabetes, for example, both glucose and Hemoglobin A1c are measured; onegives a short-term measurement while the other assesses long-termglycemic control.

[0009] Ambiguities may arise in how to stage a particular patient,depending on which markers of disease progression are used. Moreover,the defined stages of the disease may overlap. Accordingly, bettermethods are needed to determine (a) the disease path on which a patientis located and (b) where the patient is along that path.

[0010] In particular, since “time zero” of a patient's diseaseprogression is rarely known, there is a need for an efficient way tosynchronize the relevant biomedical data without requiring excessivecomputation. Typically, the available data consists of clinical recordswhich describe changes to several quantitative variables over time. Aninvestigator wishes to stratify patients into groups depending on theirclinical course. This process is complicated by the fact that the datais generally unsynchronized, i.e., data records begin at varying pointsin the course of the disease. Therefore, patients whose data do not lookalike, in terms of their current clinical picture, in fact may belong tothe same disease stratum because they could be at different time pointsalong the continuum of a given pattern of disease progression (orstratum).

SUMMARY OF THE INVENTION

[0011] A solution to one or more of the previously describeddeficiencies can be achieved by an information processing method andsystem which can stratify a disease and predict its progression and moreaccurately synchronize the relevant disease progression data withoutrequiring excessive computation.

[0012] An information processing method, system, and software forsynchronization of disease progression data of individual patients,includes: receiving disease progression data in an aperiodic form;representing the disease progression data as a set of functions havingfinite asymptotic values; and clustering parameters of the set offunctions. The step of representing the disease progression data as aset of functions includes transforming the functions into time invariantform and thereby synchronizing individual patient data that isclustered.

[0013] A better understanding of the information processing method andsystem assessment of disease progression by synchronization of diseaseprogression data will be easier to appreciate when considering thedetailed description in light of the figures described below.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] The accompanying drawings, which are incorporated in andconstitute a part of the specification, illustrate an embodiment of theinvention and together with the description, serve to explain theprinciples of the invention.

[0015]FIG. 1, which is a flow diagram of the current treatment protocolfor kidney disease, shows how approximately forty distinct diseases leadto end stage renal disease which is then currently treated by dialysisand possibly further by kidney transplant;

[0016]FIG. 2(a) is a plot of tumor size versus time for one genotype ofa particular type of cancer; FIG. 2(b) is a plot of tumor size versustime for another genotype of the same cancer shown in FIG. 2(a);

[0017]FIG. 3(a) is a plot of a first patient's tumor growth versus time;FIG. 3(b) is a plot of a second patient's tumor growth versus time; FIG.3(c) is a plot of a third patient's tumor growth versus time; FIG. 3(d)is a plot of a fourth patient's tumor growth versus time. The patientsin FIGS. 3(a)-3(d) have the same general type of cancer, although theymay have different forms of it;

[0018]FIG. 4(a) depicts the tumor growth plots for the four patientsrepresented in FIGS. 3(a)-3(d) when plotted over the same time course;FIG. 4(b), which depicts the curves of FIG. 4(a) realigned, shows thattwo of the patients in FIGS. 3(a)-3(d) likely share one genotype of thedisease represented by one stratum of disease progression whereas theother two patients in FIGS. 3(a)-3(d) likely share a different genotypeof the disease represented by a different stratum;

[0019]FIG. 5 is a flowchart representing the formulation of a modelbased on the measured time dependent data which is used to determine aparticular disease's strata;

[0020]FIG. 6 shows a plot of a stratum for Hemoglobin A1C, entitled“HBA1C;”

[0021]FIG. 7 shows a plot of a stratum for Retinopathy, entitled“ETDRS”;

[0022]FIG. 8 shows a plot of a stratum for Motor Nerve Velocity; and

[0023]FIG. 9 shows a plot of a stratum for Sensory Nerve Velocity.

[0024]FIG. 10 is diagram illustrating a logistic curve.

[0025]FIG. 11 is a flow diagram illustrating the steps of a datasynchronization and clustering algorithm in one embodiment of theinvention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0026] Reference will now be made in detail to embodiments of theinvention, which are illustrated in the drawings. In one aspect, thepresent invention comprehends a model of disease progression that isbased entirely on the data provided. The approach of the invention doesnot require input regarding the underlying theory or mechanisms of thedisease. In one aspect, as discussed with respect to FIGS. 1-9, thepresent invention relates to employing disease progression data as thebasis for stratification and staging. In another aspect, in the sectionentitled “Single-Pass Synchronization of Biomedical Data,” the presentinvention relates to synchronization of disease progression data (as anexample of the more generic category of biomedical data), so that thedisease progression data from different sources (or patients) can bemeaningfully aligned along the time dimension.

[0027] The present invention, in one aspect, employs clinicalobservations of patients or other organisms as the basis forstratification and staging. The observations are stored and processed ina digital computer system. Some or all of the observations, from some orall of the patients, may be processed at once. Parameters derived fromthe data are subjected to the statistical procedure known as “clusteranalysis,” which groups patients together based on the shape of thecurves representing changes in observed variables over time. Eachcluster of patients potentially represents a different disease stratum.Adjustments are made to account for the fact that observations ofdifferent patients begin at different points in the progression of theirrespective disease processes. These adjustments can be used to determinethe stage of disease progression for each individual patient withintheir disease stratum. Once the strata and stages are initially defined,the cluster analysis and adjustments can be repeated, so that aconvergent, iterative process of stratification and staging takes place.Furthermore, the section entitled “Single-Pass Synchronization ofBiomedical Data,” provides a technique for synchronization of data alongthe time dimension without requiring multiple iterations and thistechnique may be used to assist the process of stratification andstaging as also discussed further herein.

[0028] The present invention stratifies diseases based on observationsof patients. The term “stratification” refers to the identification ofsubsets within what has been traditionally known as a single disease,such as breast cancer. A “patient” typically refers to a humanindividual affected by a disease, but it encompasses animals and evenplants that are subject to disease processes. Uses of stratificationinclude: (a) identifying molecules which are targets for the developmentof therapeutic drugs, aimed at a particular disease stratum; (b)selecting optimum therapy, which may include drugs and/or lifestylechanges, based on a particular stratum; (c) selecting diagnostic testsbased on a particular stratum; or (d) predicting the course of diseasebased on the stratum into which that patient falls.

[0029] As a hypothetical example, FIGS. 2(a) and 2(b) show plot of atumor growth over time for two different genotypes of cancer. Tumor sizeis associated with the severity of the disease. Genotype A1 201 andGenotype A2 202 may clinically appear to be the same disease, but theyfollow different time courses. By analyzing data from a large number ofpatients over time, the present invention can assist the clinician andresearcher in distinguishing between these two distinct forms of cancer,which may in fact respond to different kinds of treatment. Forsimplicity, a single disease-associated variable, tumor size, is shown.In an actual application, the distinctions between Genotype A1 201 andGenotype A2 202 might not be apparent unless several additionalvariables, such as cell DNA content and expression of various genes, areexamined in a high-dimensional space.

[0030] The present invention also determines the stage of progression ofa patient's disease, based on an analysis of observations of thepatient. Diseases tend to progress through a series of stages over time,particularly if untreated. Treatment may modify the order ofprogression, or may alter the amount of time spent in each stage of thedisease process. FIG. 1 shows an example of the stages of renal diseaseleading to kidney failure and transplant. Any one of a large number ofmedical conditions can bring a patient into a state of end-stage renaldisease 105, in which the kidneys are no longer competent to filterwaste products from the bloodstream. The patient will then be placed ondialysis 110. A number of dialysis patients will go on to receive kidneytransplants 115. Some of these will suffer acute rejection 120 and lossof the kidney due to the immune response. Others will suffer effectsfrom chronic rejection 125, but will eventually be able to maintain somestate of health with the transplanted kidney. While FIG. 1 illustratesdisease stages as discrete steps, other diseases progress on acontinuous basis, and the distinction between stages (e.g., tumorsstaged as I, II, III, etc.) is not a natural division, but rather aconvenience for the clinician and researcher.

[0031] It is important that each patient be observed periodically overtime. If observations are not made at several points in time, one cannottell, for instance, if a patient is being seen early in the course of asevere disease, or later in the course of a milder one. The observationsof each patient may consist of any of the items that might enter apatient's medical record. Results of a family history and physicalexamination may be included, along with laboratory test results fromblood, urine, or other specimens. Imaging studies such as MRI may beincluded. Special tests such as electrocardiograms or pulmonary-functiontests may be included. Results of histological/pathological examinationof specimens may be included as well. Results of genetic testing may beincluded, and are expected to fulfill an important role in the future.Data from DNA microarrays may be included to measure gene expression inpatient tissues of importance. Data from newer microarray technology maymeasure protein expression as well. The date of observation may berecorded, along with the observation itself. For some patients,observations may cover the entire time course of the disease, includingthe time period prior to the appearance of the first symptoms.

[0032] In all cases, these data should be obtained in or converted intoa form that will permit two observations to be compared in a numericalfashion, in order to determine a “distance” between them. For verbaldescriptions such as in the physical exam, this can be accomplished witha controlled vocabulary and numerical coding. For example “The patientappears well” could be coded as a “5,” with “The patient appears acutelyill” as a “3,” and “The patient is comatose” as a “1.” For imagingstudies, it may be necessary to measure features within the image, suchas the diameter of tumors. More subjective features, such as pulmonaryinfiltrates in a chest X-ray, could, for example, be rated by clinicianson a scale of 0/+ to ++++, coded by the numbers 0 to 4. Presence orabsence of genes may be coded as 0 or 1. Multiple possible alleles of agiven gene may each be given a particular code. An “observation” refersto a single number, or description that can be converted to a number,associated with a particular patient at a particular time. A “variable”is an aspect of the patient that may be observed, such as bloodpressure, tumor diameter, serum creatinine level, or the expressionlevel of a particular gene.

[0033] In general, a patient may have more than one disease, andmultiple diseases may interact. A given disease may be characterized byone or more observations, or by a measure of disease progression derivedfrom those observations. This includes disease-progression measuresderived from the present invention. Such measures may fill the role of“observations” in the investigation of a second disease present in thesame patient. Thus, the present invention may be generalized so that itcan be used to study more than one disease at a time in a particularpatient population.

[0034]FIG. 5 shows a flowchart of the analysis process. Observations arestored in a digital computer system. The observations may be enteredmanually via a keyboard, or may be transferred from another computersuch as a Laboratory Information Management System (LIMS), electronicmedical record, or genetic analysis system. As shown in steps 505 and510, the data set of disease observations over time is used to select asubject for analysis of disease based on, for example, demographics andtreatment history.

[0035] While “staging” of diseases is generally thought of in discreteterms (e.g., “Stage I,” “Stage II,” “Stage III,” etc.), for purposes ofthis invention, the stage of disease is generally a continuous numericalvalue. These continuous staging estimates can be derived by shifting thepatient time series in step 515 with respect to one another within eachstratum so that they are aligned. FIG. 4(a) shows that, if the patientdata series 301-304 shown in FIGS. 3(a)-(d) are aligned in “real time,”they cannot be directly compared against one another, because they arenot aligned in terms of the stage of the disease process. FIG. 4B showsthe patient data series 301-304 aligned in “real-time”.

[0036] As shown in steps 520-545, once the time series are aligned, thenext goal is to stratify the disease by clustering patients together whohave similar time courses. This process begins with the creation of a“distance matrix,” as known to one skilled in statistics, particularlycluster analysis. A triangular matrix of distances among all pairs ofpatients must be computed. Each inter-patient distance will be afunction of individual distances calculated for each variable. Thefunction would take the form of a sum or weighted sum. The distances fora given variable would be, in turn, a sum of distances betweenindividual observations for that variable. This sum also may beweighted.

[0037] In conventional clustering, one typically works from a distancematrix, which lists the similarity of every object to be clusteredversus every other object. Conventionally, this distance matrix iscomputed once at the start, and then used during the clustering process.However, time shifts inherent in the data cause the distance matrix tovary dynamically as the clusters are formed. This simply means that partof the distance matrix must be updated whenever a cluster is formed.

[0038] Distances between observations may be measured in several ways.In cluster analysis, absolute differences or squared differences areoften used for numerical variables. In some cases, such asnumerically-encoded gene alleles, it may be desirable to manually createa lookup table to evaluate the “distance” between any two possibleobservations.

[0039] For the stratification and staging process to be effective, itmay be necessary to restrict the population of patients for which theanalysis is carried out. For example, it would not be meaningful tocompare certain variables observed in babies with the same variables inadults, even if they share the same disease. Also, it will be necessaryto ensure that a single analysis does not include a mix of patients whohave been subjected to widely varying therapeutic interventions.Otherwise, the method will likely create false “strata” consisting oftreated patients in one stratum, and untreated patients in another.Thus, in one embodiment of the invention includes a step of specifyingcriteria in terms of patient demographics (age, height, weight, sex,etc.) and treatment history. Only those patients who meet the specifiedcriteria will be included in subsequent analysis. The criteria used toselect patients will differ from one disease to another.

[0040] For purposes of subsequent cluster analysis, it will generally bedesirable to include the rate of change of variables with respect totime. There are many published algorithms for calculating thederivatives of a time series. Some of these incorporate multi-pointfiltering so as not to unduly amplify noise in the data. Thesealgorithms, such as Savitsky-Golay filters, may be useful in connectionwith some embodiments of the present invention.

[0041] For each patient, a time series, including data points for whatmay be a relatively large number of variables, is present in the dataset. In such circumstances, it is generally found that a number ofvariables are highly correlated with one another. Thus, there may be“extra” variables that carry little significant information. Neuralnetworks and statistical techniques, such as principal componentsanalysis and factor analysis, may be used to reduce the number ofvariables carried forward into the calculation. Parenthetically, thesetechniques can have the added advantage that they give insight into therelationships among the variables being studied, and can reduce thenumber of variables needed for future studies.

[0042] The iterative process of disease stratification 530-540 andstaging begins by clustering the patients. Each patient has a number oftime-dependent measurements associated with him or her which define atime progression (also called a time series). Each time progressiondescribes a curve corresponding to the observed variable measurementsover time. The initial clustering is based on the shape of these curves.Clustering must be based on curve shape rather than on a direct distancemeasure between the curves, because observations for each patient beginat a different point in time along the course of that patient's diseaseprocess (i.e., the calendar date of the observation gives no indicationas to how far a patient's disease has progressed). Except in specialcases, such as accidental laboratory infection, one does not generallyknow when “time zero” is. As the computer analyzes the entire timecourse of a disease, it distinguishes between a patient who is in theearly stages of a severe disease from a patient who is in the laterstages of a milder one (since the curve shapes will generally bedifferent in the two cases).

[0043] Clustering of curve shapes can be accomplished by any of severaltime progression alignment algorithms. Any conventional clusteringalgorithm may be used to do the stratification. There are many suchalgorithms, such as “Single Linkage,” “Complete Linkage,” “K” means,“Ward's Method,” or the “Centroid Method.” These algorithms would bewell-known to anyone familiar with the data analysis art, and areavailable in standard statistical packages such as SAS and SPSS. Thesealgorithms group like objects together, and keep unlike objects inseparate groups. As an initial step, a Savitsky-Golay filter or similarformula can be used to calculate time derivatives for the values formingthe curve, thereby eliminating the effect of any constant offset fromone curve to another, while also emphasizing curvature and othershape-defining features. The curves can then be aligned with respect toone another by an algorithm such as dynamic programming or wavelettransforms. Each cluster may represent a stratum of disease. It may bedesirable for a human operator to split or merge clusters, afterexamining the data in detail, to obtain the most clinically-meaningfuldisease stratification.

[0044] We start with each patient in a separate stratum, then let theclustering algorithm agglomerate these strata. The strata aretime-shifted with respect to one another when combined, to account forthe fact that a patient is almost never observed a “time zero” of thedisease process. Further, each patient (or stratum) has a firstobservation at a different point in the disease process. The appropriateamount of time shift can be determined either iteratively (a range ofpossible shift amounts is applied and the one that gives the best fit toa mathematical model is chosen) or analytically (least-squares equationsare solved, based on the models themselves, to find the besttime-shift).

[0045] When combining strata, we next find a “consensus” time shift thatgives an acceptable fit for all of the disease variables measured.Finally, the combined strata are fit to an overall mathematical modelwhich is subsequently re-tested to ensure an acceptable fit. Withoutre-testing the model, it is conceivable that the model would represent along “daisy chain” of patients, strung together in time, in a way thatwould not represent any plausible disease process.

[0046] Within each stratum, the time series for each patient may befurther aligned in time to reduce the mean inter-patient distances. Theamount of shift required to bring the time series into alignment can beused directly to update the estimate of the patient's current diseasestage. This is equivalent to estimating the calendar date of “time zero”for that patient. The cluster analysis can then be repeated. Thisiterative process will generally converge. At the end, the clusters willrepresent disease strata, and the amounts of shifting applied to eachpatient's data, along with the observations as the final time point,indicate the stage of progression of each patient's disease. FIG. 4(b)shows the result of this analysis process. The data are aligned bydisease stage, and can therefore be clustered into strata representingsubsets of the disease under study. The distance from the time origin tothe open circle is a measure of the disease stage, or progression, foreach patient.

[0047] In summary, the synchronization and stratification uses athree-step process of clustering, where, to combine a pair of strataone: (1) determines a best time-shift for each variable; (2) determinesa consensus time-shift for all variables together; (3) fits thecombined, shifted data to a model; and (4) accepts the combined stratumas valid if the fit is acceptable upon re-testing the model.

[0048] An approach to assist in the synchronization of patient timecourse events may include those described in Prestrelski et al.,Proteins 14: 430-39, 440-50 (1992). Prestrelski sets forth a methodwhich enables the alignment and synchronization of discretely measuredfeatures and permit determination and compensation for gaps in themeasurement variable, using dynamic programming methods.

[0049] In the examples of the Prestrelski articles, the time domain atvarying points, which may or may not be coordinated in sampling orsynchronization, was not sampled. Rather, the equivalent domain wasdefined as the position, within an amino acid sequence, which could besimilarly numbered in a manner which may be non-identical. The positionwas chosen as the domain because of the presence of gaps or insertionswithin the linear axis or at the beginning of the axis coordinate.

[0050] An example of the application to stratification and clustering indisease analysis can be seen in the application to the examination of adatabase of heart transplant recipients and donors. In such a study,there is a great deal of information concerning the recipient both pre-and post-transplant, and minimal information concerning the donorpre-transplant and none post-transplant. A desired outcome of suchanalysis would be to determine the potential for enhancing the criteriaused to match donors and recipients to enable greater success in thetransplant procedure, i.e., survival of the recipient with atransplanted heart. The standard of care requires tissue typing andmatching. Additional algorithms, based on the potential matching ofdonors with recipients of lesser body mass, have been implemented withthe expectation that the heart (which is comprised of muscle) would bemore likely to survive any atrophy occurring during the transplant andmore successful in a smaller recipient. Analysis of this data wouldnormally focus on predicting survival versus non-survival which could berepresented by a 1 and 0, respectively.

[0051] Application of the dynamic programming analysis described in thePrestrelski et al. articles enables the donor weight to recipient weightfactor to be further refined to incorporate the fact that recipients aretypically physically compromised at time of transplant and their actualweight will be below their ideal weight, which more closely reflects thedesired organ functional profile. In addition, the donor may, by virtueof being overweight or in poor physical shape, be significantly higherthan their ideal weight; dependence on the simple actual weight ratiosmay not incorporate the “quality” of the donated material adequately.Further, analysis of the survival/non-survival state indicated that thissimple classifier was inadequate to represent: (a) the actual desiredoutcome (which was length of survival); and (b) the potential ability ofstandard of care procedures to evaluate this adequately post-transplant.Conversion of the scoring of the patients to reflect length of time withsuccessful transplant survival: (a) enabled the progression oftransplant success or failure to be more accurately determined; (b)enabled the identification of several specific clusters of progression(in time) which could be related to causative factors that could beanticipated and corrected prior to the procedure; and (c) evaluated thepotential utility of the standard of care post-transplant. Accordingly,laboratory tests were successful in warning of potential risks for organfailure or rejection.

[0052] FIGS. 3(a)-(d) show the time course of tumor growth 301-304 forfour patients (continuing the hypothetical cancer example set forth inFIGS. 2(a) and 2(b)). The graphed lines in each figure begin with thefirst measurement taken on the patient corresponding to each of thosefigures. In general, patients will seek medical care at different pointsin the progression of their cancer, when symptoms first appear. Thus, nodata are available to cover the pre-symptomatic period, even though thetumor exists and is growing during that time. The open circle representsthe date of the latest (most current) measurement for each patient.

[0053] Stratification and staging data can then be used for thedevelopment of diagnostics, therapeutics, and lifestyle guidelines, andcan be used to predict disease outcome and optimize therapy for aparticular patient. Once the full analysis has been performed on anadequate set of patients, it is much simpler to stratify and stagedisease for a new additional patient. The new patient's observations canbe simply aligned and clustered for a best fit to the existing data set.In addition, new observations based on new technologies or methodologiessuch as clinical, biological, genetic, etc. can be incorporated into thestratification process at any time. The alignment will indicate thedisease stage previously described, and the cluster assignment willindicate the stratum to which the patient belongs. Moreover, the modelcan be updated to reflect the new patient; in this fashion the accuracyof the model can be continuously improved over time.

[0054] To elucidate the conceptual description of embodiments of theinvention, an explanation of the method by which the foregoing isaccomplished will now be set forth by describing, in detail, a processfor stratification and synchronization of patient data to form a diseasemodel.

[0055] Preliminarily, inputs for the model must be defined. The input tothe disease modeling process is a set of observations over time, made ona set of N patients, designated i=1 . . . N. There are M differentclinical variables which are observed, and these are designated j=1 . .. M. Each variable is observed for each patient at a time designated byt. The number of observations, which may vary among the N patients, foreach patient are indexed by k=1 . . . n_(i). In general, the values of tmay differ from patient to patient, and from variable to variable. Thus,the observations consists of an ordered set of pairs {t_(ijk), y_(ijk)},i=1 . . . N, j=1 . . . M, k=1 . . . n_(i). where for each time t (andfor each patient N), there is a corresponding measurement y for eachvariable M.

[0056] A first output of the disease modeling process is designed andintended to partition the patient population into strata, or clusters.Each stratum represents a pattern in the way that a prototypical “modelpatient” can progress through a disease. In other words, members of agiven stratum share a similar pattern in the way that their observeddisease variables evolve over time.

[0057] Depending on the particular clustering algorithm used, a givenpatient may appear to fall into more than one stratum. For example, thiscan happen if the patient is only observed early in the course of theirdisease, and there is not enough information to fully determine to whichstratum the patient belongs. It could also happen if the observationsoccur late in the disease process, and it cannot determined by whichpath the patient got there.

[0058] A second output of the disease modeling process is a set of modelfunctions for each variable and for each stratum. These model functionsdescribe the pattern by which each variable can be expected to evolveover time for a patient who is a member of the given stratum. A thirdoutput of the disease-modeling process is a set of time-offset values,one for each instance where a patient is a member of a stratum. The timeoffset values are determined such that they shift the data for the givenpatient in time to give the best fit (in a least-squares sense) of thepatient's observed data to the corresponding model functions for thestratum. Note that there is one time-offset value per patient, not oneper variable. All of the variables for a given patient are inherentlylinked in time by their co-occurrence in an actual patient and,therefore, are not shifted in time with respect to one another.

[0059] To achieve the desired outputs, an understanding of thestratification and synchronization process is necessary. Thesynchronization process causes patient records to be offset from oneanother in time as they are joined together to form strata. A stratumformed by the joining of patients in this fashion is designated by atriple (A, B, Δ), which means “the record for patient B is appended tothe record for patient A with an offset of Δ between the firstobservation time for A and the first observation time for B. The sign ofΔ is positive if B's first observation occurs later than A's andnegative if B's first observation occurs before A's. “Strata” thenrecursively play the role of “patients” in the joining process. Forexample, a finalized stratum might be designated this way:

(((A,B,−10.3),(C,D,−6.1),+3.2),E,+1.7)

[0060] If (A,B,−10.3) is assigned “Q,” and (C,D,−6.1) is assigned “W,”the result becomes:

((Q,W,+3.2),E,+1.7).

[0061] Further, if (Q,W,+3.2) is assigned “Z,” the finalized stratumbecomes:

(Z,E,+1.7)

[0062] To begin the modeling process, each patient is placed into itsown stratum. That is, patient A becomes a stratum: (A, null, 0). Thepatient data may be pre-conditioned before the modeling algorithm isapplied. The variables should be transformed if necessary (log, squareroot, etc.) to stabilize variance, so that equal differences in y haveequal clinical significance. Variables that are oscillatory or periodicshould be replaced by variables that will fit the smoother models usedhere (e.g., an envelope or amplitude function, or some indication of thenumber of oscillatory cycles or their frequency). Noise in the data maybe removed by digital filtering prior to the stratification processitself.

[0063] At each step of the process below, data for the variables withineach stratum are fit to mathematical model functions. The mathematicalformulation of the model functions should be chosen so that the modelcurves exhibit the same general shape features as the actual data. Theformulations should also be chosen to have clinically-appropriatebehavior when extrapolated beyond the time interval over which theactual data is fit. Thus, mathematically simple forms, such as quadraticand cubic models, may be undesirable, because they diverge to ± outsideof the region where they are initially fit. A linear model has beensuccessfully employed, because the error introduced by extrapolation isacceptable.

[0064] Within the guidelines above, other model formulations can be usedbesides the ones described here. In the modeling process, for example,four different mathematical formulations for models are used insuccession:

Constant: y(t)=α

Linear: y(t)=α+βt${\text{Logistic:}\quad {y(t)}} = {{a + {\left( {b - a} \right)\frac{^{\alpha + {\beta \quad t}}}{1 + ^{\alpha + {\beta \quad t}}}\quad {or}\quad {\ln \left( {- \frac{{y(t)} - a}{{y(t)} - b}} \right)}}} = {\alpha + {\beta \quad t}}}$${\text{Quadratic Logistic:}\quad {y(t)}} = {{a + {\left( {b - a} \right)\frac{^{\alpha + {\beta \quad t} + {\gamma \quad t^{2}}}}{1 + ^{\alpha + {\beta \quad t} + {\gamma \quad t^{2}}}}\quad {or}\quad {\ln \left( {- \frac{{y(t)} - a}{{y(t)} - b}} \right)}}} = {\alpha + {\beta \quad t} + {\gamma \quad t^{2}}}}$

[0065] For a given stratum, each variable ultimately fits into one ofthese four types of models. Fitting takes place by the followingprocess: First, the data is “fit to a constant” by least squares. Thisis equivalent to simply setting α equal to the mean value of the data.The root-mean-square (RMS) deviation of the data from the model is thendetermined.

[0066] Second, the data is fit to a linear model, and the RMS deviationfrom the best-fit straight line is determined. If the RMS deviationdecreases by more than a specified fraction (a parameter of the modelingprocess), then the linear model is accepted. Otherwise, the constantmodel is used.

[0067] Third, the data is fit to a logistic curve by an iterativeleast-squares fitting procedure. The least-squares fitting, in oneembodiment, employs a Java routine developed by Steven Verrill of theU.S. Forestry Service, and is adapted from a corresponding FORTRANsoftware package described in R. B. Schnabel, J. E. Koontz, and B. E.Weiss, A Modular System of Algorithms for Unconstrained Minimization,Report CU-CS-240-82, Comp. Sci. Dept., University of Colorado atBoulder, 1982. The linear model is used to establish initial values forthe least-squares iteration. Again the RMS deviation of the data fromthe curve is determined, and if the fit improves sufficiently versus thelinear model, the logistic model is accepted.

[0068] Fourthly, and finally, this procedure of fitting, followed byacceptance of the new model if the fit improves sufficiently, isrepeated for the quadratic logistic curve. At the end of this step, foreach stratum, i.e., for each of the variables, there is a description ofthe type of model (i.e., constant, linear, logistic, orquadratic-logistic) and the number of parameters for the model. Constantmodels have one parameter, linear models have two, logistic models,four, and quadratic-logistic models, five.

[0069] The next step examines all pairs of strata. Note that pairs are“ordered pairs,” i.e., (A, B) is not equivalent to (B, A). Whencombining strata, no patient can appear more than once in thecombination. Any pairs in which a given patient appears in both stratumA and stratum B are ignored. For each pair of strata, each variable isconsidered in turn. The first step, for each variable, is to determinethe best values (over a suitable range) for Δ, such that the data forstratum B fits (in a least-squares sense) the model for stratum A whenoffset in time by Δ. In the present example, this is done by simplyiterating the least-squares calculation at a series of equally-spacedcandidate values for Δ; an alternative would be to generate a set ofnormal equations and solve for the best value of Δ directly. Note thatseveral values of Δ may give nearly the same degree of fit. In fact, ifthe model for patient A is constant, all values for Δ give anequivalently good fit within some range ε, which is a parameter of themodeling process. Thus, at this step in the process, Δ may be a list ofvalues or a range, rather than a single value.

[0070] The algorithm rejects the pair of strata if the best Δ gives afit to B's data which does not have a small enough RMS deviation fromthe curve of A's model. The threshold for RMS deviation is anotherparameter of the modeling process which one of ordinary skill in the artof statistics can set at an appropriate value depending on the nature ofthe analysis. If this occurs for any variable, then A and B are notconsidered candidates for inclusion into the same stratum during thecurrent stage of the process. If, however, the stratum pair (A, B)yields an acceptable Δ (or set of Δ's) for all variables, then the nextstep is to try to reconcile these values into a single Δ for allvariables. There can be only one Δ which relates stratum A and stratumB. It is not physically realistic for there to be a separate Δ for eachvariable, since these data stem from real observations of a real patientat a particular single point in time.

[0071] In this example, the process is to count the number of variableswhich are consistent with each of the values of Δ listed for the stratumpair. This results in a reduced list of Δ's which are common to all ofthe variables. If the reduced list contains more than one possible valuefor Δ, in this example the Δ with the smallest absolute value is chosen.Other options for resolving such ties, such as picking the Δ which givesthe best overall RMS fit, may be considered.

[0072] At this point, strata A and B are merged into a new stratum,designated (A, B, Δ), i.e., the data for A and B are combined, using anoffset of Δ for B's data with respect to A's. A new stratum for thecombined stratum is then determined using the four model types asdescribed above. The new stratum is “accepted” if the final RMS modelfit for the combined data set is sufficiently good, as determined bycomparing it against a value which is a parameter of the fittingprocess. If the stratum is accepted, the stratum (A, B, Δ) is added tothe set of strata for evaluation.

[0073] The steps of evaluating pairs are repeated until all possiblepairs have been evaluated. At that time, the list of accepted strata maybe edited to remove strata below a certain size, and/or those which havenot merged with another stratum during a certain number of passes.Editing may be done by some other method which permits the accumulationof large strata while reducing the time spent repetitively evaluatingsmall strata which are “outliers” and are unlikely to merge. Thepair-evaluation process is then repeated for a subsequent pass, until nonew strata are formed.

[0074] As an alternative to the merging of pairs described above, analternative clustering algorithm may be used, such as the “leaderalgorithm” described in J. W. Hartigan, CLUSTERING ALGORITHMS (JohnWiley & Sons, 1975), at pages 74-83. In addition, in a clinical orpharmaceutical research context, membership and position in the variousstrata can be correlated with clinical and genomic data.

EXAMPLE #1

[0075] Data for modeling were taken from public files for the DiabetesControl and Complications Trial, which are available via ftp on theInternet at gcrc.umn.edu/pub/dcct/. Records for 730 patients in theStandard treatment group were used, since the patients in theExperimental treatment group were artificially “synchronized” by theintervention of the trial. For each patient, ten annual measurementswere extracted for four variables (i.e., I=1 . . . 730, j=1 . . . 4, k=1. . . 10): (a) Hemoglobin A1C (a measure of blood-glucose control); (b)Retinopathy (ETDRS scale scores from fundus photographs, the fundusbeing the part of an eyeball); (c) Motor Nerve Velocity; and (d) SensoryNerve Velocity. The latter two values are measures of peripheralneuropathy, another complication of diabetes. Missing values were filledfrom the most recent previous available value.

[0076] The algorithm previously described was used to cluster thepatients into strata by employing time shifts to align like shapedcurves. Results for the four observed variables strata are shown inFIGS. 6-9 in which: (a) FIG. 6 shows a stratum 601 for Hemoglobin A1C,entitled “HBA1C;” (b) FIG. 7 shows a stratum 701 for Retinopathy,entitled “ETDRS;” (c) FIG. 8 shows a stratum 801 for Motor NerveVelocity; and (d) FIG. 9 shows a stratum 901 for Sensory Nerve Velocity.FIGS. 5-9 indicate how the patient records may be fit together by usingan appropriate time shift. Thus, each stratum describes a picture of howa prototypical patient would progress through their disease with regardto the four variables studied. The markers in the figures indicateactual patient data points; the lines in each of FIGS. 6-9 are thebest-fit modeling function for the strata.

[0077] Single-Pass Synchronization of Biomedical Data

[0078] Biomedical data, such as disease progression data, may besynchronized without requiring iterations in the synchronizationprocess. Therefore, the synchronization process can be computed moreefficiently and by using standard software packagesfor calculations andclustering.

[0079] In one aspect, the synchronization technique disclosed hereinsynchronizes disease progression data (as an example of biomedical datathat may synchronized using the techniques disclosed herein). Theavailable data typically consists of clinical records which describe thechanges in several quantitative variables over time. Typically, the datacovers an insufficient duration to describe the entire course of apatient's disease. Therefore, an investigator that wishes to stratifythe patients into groups encounters additional complications based onthe fact that the disease progression data of the various patients arenot synchronized along the time dimension (i.e., with respect to ahypothetical time zero). Therefore, patients' data that do not lookalike may actually belong to the same disease stratum because they mayrepresent data from different time points along the continuum of a givenpattern of disease progression.

[0080] The data synchronization technique disclosed here solves at leastthree problems in the stratification and synchronization of diseaseprogression data as listed below.

[0081] (1) By fitting the data to logistic curves or other similarforms, the possibility of periodic data is eliminated. One skilled inthe art would recognize that it is not possible to synchronize periodicdata per se since the problem would be inherently ambiguous. A similarsituation arises in bioinformatics, in that it is not possible tounambiguously align repetitive DNA sequences. In some cases, periodicdata can be transformed into an aperiodic form. For example, if adisease is characterized by periodic episodes, one could utilize thecumulative count of episodes, over time, as an aperiodic variable in thepresent method.

[0082] (2) When only short segments of data (i.e., over a shortduration) are available for some patients, they may appear to be linearor constant data segments. Fitting this type of data to a logistic orsimilar curve is an ill conditioned problem. Accordingly, the presentdata synchronization technique handles such data segments appropriatelyby using constant or linear models for them.

[0083] (3) The technique provides a translation invariant description ofa patient's data, so that synchronization takes place automatically inaddition to facilitating the stratification process. In other iterativetechniques, the complexity of the synchronization process growsexponentially as the complexity of the data set grows.

[0084] In one aspect, the data synchronization technique of the presentinvention involves representing clinical data as a set of functionshaving finite asymptotic values. For example, in one embodiment, theclinical data may be represented as a set of logistic curves. Logisticcurves can be applied to modeling disease progression data, since theytransition from an initial constant value, at a specified rate untilthey reach a final constant value. They are unidirectional, so that theydo not fit all types of data although they fit many clinical variablesvery well. They have appropriate asymptotic behavior and are notperiodic. In should be understood that other mathematical models may beused in the present invention as long as they have finite asymptoticvalues and are preferably aperiodic. A quadratic-logistic or other typeof model can be used accommodate data increases to a peak or plateauvalue and then declines. In such a case, a curve representing a clinicalvariable may have several linear or constant segments. The techniquedisclosed herein may be adapted to deal with them as discussed furtherherein.

[0085] If all the data could be fit to a logistic curve, stratificationand synchronization would be straightforward. However, much of the datain a typical data set may consist of short segments which lackstatistically significant curvature. These data segments may appear tobe either linear, or constant over time. Attempting to fit these datasegments to a logistic curve would pose an ill-conditioned mathematicalproblem. The strategy for dealing with these quasi-linear orquasi-constant data records is provided further herein.

[0086]FIG. 10 illustrates a typical logistic curve 1001 that may be usedfor the data synchronization technique provided by the presentinvention. The logistic curve 1001 may be represented by an equation ofthe form:${\Lambda \left( {a,b,\beta,\gamma} \right)} = {a + {\left( {b - a} \right)\frac{^{\beta {({t - \gamma})}}}{1 + ^{\beta {({t - \gamma})}}}}}$

[0087] It should be noted that that Δ, the x-intercept of the linearportion of the logistic curve 1001, is not a parameter of the logisticfunction, but is an auxiliary variable that will be used by the datasynchronization technique provided by one aspect of the presentinvention. Formulas for calculating Δ from the curve parameters arediscussed further herein. In the formula above, a and b represent the yintercepts of the logistic curve 1001, γ represents the location of theinflection point of the logistic curve 1001, and β represents the slopeof the logistic curve at the inflection point.

[0088] As shown in flow chart of FIG. 11, clinical data that is receivedin step 1105, typically cannot be used directly for stratification.Because each patient record samples only a portion of the underlyinglogistic function, data from the same underlying logistic function mayappear very different, depending on what portion of the curve is beingsampled. Instead, in step 1110, the data synchronization techniqueproposed by one aspect of the present invention clusters the parametersof the logistic curves (the a's, b's, β's, and γ's) in a transformedversion. The transformation accomplishes two things: it renders therepresentation of the data into a translation-invariant form, so that“synchronization” occurs automatically along with stratification; and ithandles the case of linear or constant data, where a fit to a logisticcurve would be mathematically ill-conditioned.

[0089] Thereafter, in step 1115, the data synchronization methodutilizes a clustering algorithm to partition N patients into groupsbased on data about V variables for each of the N patients. Depending onthe clustering algorithm, the number of groups may be pre-specified(e.g., K-means clustering algorithm) or may be determined by thealgorithm from specified parameters (e.g., using the complete linkagetechnique). For each patient i and variable j, there is, therefore, aset of data points

{t _(ijl) ,d _(ijl) },l=1 . . . n _(ij)

[0090] Ignoring, for the moment, the possibility of linear or constantdata for the moment, each variable will be fit, for each patient, to alogistic model$m_{ijl} = {{\Lambda \left( {a_{ij},b_{ij},{\beta_{{ij},}\gamma_{ij}}} \right)} = {a_{ij} + {\left( {b_{ij} - a_{ij}} \right)\frac{^{\beta_{ij}{({t_{ijl} - \gamma_{ij}})}}}{1 + ^{\beta_{ij}{({t_{ijl} - \gamma_{ij}})}}}}}}$

[0091] This fitting (in the clustering step 1115) can be performed bynonlinear least squares, for example, by using a local Taylor seriesexpansion for the logistic function. A Jacobian matrix can be computed,allowing the calculation of an error covariance matrix for the fit ofthe function to the data. Depending on the clustering algorithm used,this covariance matrix may be supplied as input to the distance functionused in the clustering algorithm.

[0092] Standard Euclidean distance functions can be used as the distancefunction. But if an estimate of the variance for each variable isavailable, an appropriate distance measure might be based on the z-score(as defined below).

[0093] So for two variables with estimates θ_(a) and θ_(b) andassociated variances σ_(a) ² and σ_(b) ² the distance (z score) can becomputed from the following formula:$z^{2} = \frac{\left( {\theta_{a} - \theta_{b}} \right)^{2}}{\sigma_{a}^{2} + \sigma_{b}^{2}}$

[0094] In one embodiment, an enormous computational advantage in thisstep can be realized by pre-clustering the patient data into “basepatterns,” which represent highly-similar data records for individualvariables. The fitting of each “base pattern” to determine logisticparameters, corresponding to the base pattern, needs to be performedonly once. Thereafter, the patients can simply be “pointed to” thesesets of parameters for “their” base patterns.

[0095] In step 1110, by changing the parameterization of the data foreach patient, the clustering algorithm in step 1115 can solve thesynchronization problem automatically, in addition to performingstratification. To accomplish this, each patient's individual curves(corresponding to the variables) are represented by using their a's,b's, (the y intercepts) but the β's and γ's ( which represent the slopeand the location of the inflection point of the logistic curve) aretransformed. These parameters (the β's and γ's) tie the curves to thetime axis, and the transformation based on these parameters results in atranslation-invariant form. In particular, the γ's (the inflectionpoints) are modified, based on the slope of the linear portion of thecurve, to give a value for Δ (the x-intercept of the linear portion ofthe logistic curve) that is used to transform the curves into thetranslation invariant form. For example, the curves may be transformedso that all the curves are transformed to have a common value for Δ.Alternatively, the value of Δ used to synchronize the curves on the timeduration may be fixed to be within a certain range of values.

[0096] Therefore, if the slope m of a logistic curve is given by$m_{ij} = \frac{\beta_{ij}\left( {b_{ij} - a_{ij}} \right)}{4}$ then$\Delta_{ij} = {\gamma_{ij} - \frac{a_{ij} + b_{ij}}{2m_{ij}}}$

[0097] And to create the translation-invariant formulation

Δ′_(ijjj*)=Δ_(ij)−Δ_(ij*)

[0098] for each pair of variables j and j*. The remainder of the methoduses a, b, m, and the transformed Δ's, to represent each curve.

[0099] In the above discussion, Δ is defined as the X-intercept, thatis, the point where the curve crosses (or where the extrapolated curvewould cross) the line y=0. To improve the accuracy of synchronization,it may be desirable to define Δ as an intercept with a fixed value y=Yother than zero. This may prevent small errors in m from being magnifiedin the calculation of Δ.

[0100] If a more complicated model is used, for example, by using aquadratic-logistic curve instead of the logistic curve, one skilled inthe art of statistics would recognize that there will be additionalparameters although the same transformation principles discussed abovewill still apply. These additional parameters may represent, forexample, the value of quasi-constant curve segments, the slope ofquasi-linear segments, and the x-intercepts of those quasi-linearsegments.

[0101] If the distance measure to be used for clustering is one thatmakes use of a covariance matrix, such as the z-score distance describedearlier herein, the covariance matrix for the Δ's can be derived bypropagation-of-error formulas. For example, a first-order Taylor seriesexpansion of the formulas discussed earlier herein can be generated.This Taylor series will then express the covariance of the Δ's in termsof the variables whose covariance is already known from thecurve-fitting procedures.

[0102] As would be recognized based on the above discussion,synchronization would be simplified if one could establish a “mastervariable” (present in the data for each individual patient) that wouldnever be constant, and could be counted on to provide a landmark tosynchronize patients and clusters in time. The synchronizationdiscussion herein assumes that no such variable exists. Lacking such amaster variable, a symmetric approach is used, where all variables aretreated equally, and a triangular matrix of Δ' differences is stored.This renders the representation of each patient's data in atranslation-invariant form. Therefore, when the patients are clustered,they are automatically “synchronized” as well.

[0103] After the clustering process, the time origin can be restored forthe data corresponding to each patient. In one embodiment, this can beaccomplished by averaging each of the Δ′'s for a given cluster, thenadjusting a given patients Δ's by an amount δ which gives the best fitto the group average. In other words, let

{overscore (Δ)}′_(kj)=mean_(lεclusterk)(Δ_(lj))

[0104] then

δ_(i)=mean_(j)({overscore (Δ)}′_(kj)−Δ_(ij))where iεcluster k

[0105] From these formulae, δ_(i) is the shift to apply to patient i toalign it to the cluster.

[0106] In one embodiment of the synchronization and clusteringalgorithm, output and input of the algorithms can be represented asfollows. A feature vector X_(i), representing a patient i with Vvariables per patient, consists of${3V} + \frac{V\left( {V - 1} \right)}{2}$

[0107] elements:

[a _(il) b _(il) m _(il) a _(i2) b _(i2) m _(i2) . . . Δ′_(il2)Δ′_(il3). . . Δ′_(i23) . . . ]

[0108] Appropriate covariance information may be retained and used inthe calculation of inter-patient differences. This information can befed into a clustering algorithm to produce a number G of patientclusters, representing G different disease progression strata orpatterns.

[0109] With K-means-type clustering algorithms, the pre-set number ofclusters G must be determined. This is often done by trial and error, orby one of several methods described in the literature on K-means as isknown to those skilled in the art of statistics.

[0110] Note that if the pre-clustering step above indicates that it ispossible to cluster patients, not just individual variables, then it maybe possible to represent each cluster only once in the clusteringalgorithm, and thereby substantially increase computation speed.

[0111] In one embodiment of the method of synchronization and clusteringaccording to the present invention, non-curvilinear data needs to behandled as discussed next herein. To handle the non curvilinear data,when the data is modeled, the model is characterized as either CONSTANT,LINEAR, or LOGISTIC. Additional types such as QUADRATIC-LOGISTIC, may bealso accommodated based on the same general principles.

[0112] To determine the appropriate type of model, the data is first fitto a CONSTANT model, r_(ijl)=s_(ij), where s_(ij)=mean_(l)(d_(ijl)). Themethod then determines the root-mean-square$e_{c;{ij}} = \sqrt{\frac{\sum\limits_{l}\left( {d_{ijl} - r_{ij}} \right)^{2}}{n_{ij}}}$

[0113] error

[0114] Thereafter, the data is fit to a LINEAR model,r_(ijl)=m_(ij)t_(ijl)+s_(ij), by the formulas of linear regression thatare familiar to those skilled in the art of statistics. Linearregression is available and described, for example, in the τm( )function of the commercially available R Programming Language which hasseveral commercially available implementations. The method thendetermines the root-mean-square error${e_{l;{ij}} = {{{\sqrt{\frac{\sum\limits_{l}\left( {d_{ijl} - r_{ij}} \right)^{2}}{n_{ij}}}.\quad {If}}\quad \frac{e_{l;{ij}}}{e_{c;{ij}}}} < \eta}},$

[0115] then the LINEAR model is accepted over the CONSTANT model, whereη is a parameter of the fitting process.

[0116] Finally, the data are fit to the LOGISTIC model${r_{ijl} = {{\Lambda \left( {a_{ij},b_{ij},\beta_{ij},\gamma_{ij}} \right)} = {a_{ij} + {b_{ij}\frac{^{\beta_{ij}{({t_{ijl} - \gamma_{ij}})}}}{1 + ^{\beta_{ij}{({t_{ijl} - \gamma_{ij}})}}}}}}},$

[0117] as discussed above. If the algorithm fails to converge, theLOGISTIC model is rejected. Otherwise, the method again determines theroot-mean-square error${e_{g;{ij}} = {{{\sqrt{\frac{\sum\limits_{l}\left( {d_{ijl} - r_{ij}} \right)^{2}}{n_{ij}}}.\quad {If}}\quad \frac{e_{g;{ij}}}{e_{l;{ij}}}} < \eta}},$

[0118] then the LOGISTIC model is accepted over the LINEAR model.

[0119] Exemplary variables used to describe these partial LINEAR andCONSTANT models are discussed next.

[0120] Exemplary Variables to Describe LINEAR Models

[0121] Given a linear equation r_(ij)=m_(ij)t+s_(ij) which fits dataover the range [t₁, t₂], the x-intercept is computed:$\Delta_{i\quad j} = {- \frac{s_{i\quad j}}{m_{i\quad j}}}$

[0122] The value of m_(ij) is stored, along with all the Δ′'s based onΔ_(ij). Covariance information derived from the linear regression can heoptionally stored, if the distance measure used by the clusteringalgorithm requires this. The values of a_(ij) and b_(ij) are recorded as“MISSING”. The clustering algorithm that is used must be one that istolerant of such missing data.

[0123] Variables to Describe CONSTANT Models

[0124] For CONSTANT models, there are three possible cases (refer toFIG. 10):

[0125] 1) The constant data represents a segment from the “a” end of alogistic curve.

[0126] 2) The constant data represents a segment from the “b” end of alogistic curve.

[0127] 3) The constant data represents a segment from the middle of alogistic curve, where (a-b) is small.

[0128]4) The constant data represents a segment from the middle of alogistic curve where a and b are large, but β is small.

[0129] The strategy below handles all cases correctly situation (4)above which is assumed to be rare. To handle constant data, the providedmethod sets

a _(ij) =b _(ij) −s _(ij)

[0130] The m_(ij) are all set to “MISSING”, and all Δ′ values whichdepend on Δ_(ij) are set to “MISSING”.

[0131] The distance-determining rule for the clustering algorithm needsto be slightly modified to account for the fact that s could representeither a or b. Letting dist represent the function of a and b for twopatients i and i* that is normally used to compute a component of thedistance measure between the two patients. To accommodate constant data,a modified distance rule dist’ must be used:${{dist}^{\quad \prime}\left( {a_{i\quad j},a_{i^{*}j},b_{i\quad j},b_{i^{*}j}} \right)} = \left\{ \begin{matrix}{{{{dist}\quad \left( {a_{ij},a_{i^{*}j}} \right)} + {{dist}\quad \left( {b_{ij},b_{i^{*}j}} \right)a_{i\quad j}}} \neq {b_{i\quad j}\Lambda \quad a_{i^{*}j}} \neq b_{i^{*}j}} \\{{\min \left( {{{dist}\left( {a_{ij},a_{i^{*}j}} \right)},{{dist}\quad \left( {b_{i\quad j},b_{i^{*}j}} \right)}} \right)}\quad {otherwise}}\end{matrix} \right.$

[0132] Using these heuristics, LINEAR and CONSTANT data can be fed intoa clustering algorithm, in such a way that their information and theirindeterminacy are both properly handled, so that clusters can begenerated which contain all of the appropriate patients.

[0133] It should be noted that describing the invention with drawingsshould not be construed as imposing on the invention any limitationsthat may be present in the drawings. The present invention contemplatesmethods, systems and program products on any computer readable media foraccomplishing its operations. The embodiments of the present inventionmay be implemented using an existing computer processor, or by a specialpurpose computer processor incorporated for this or another purpose.

[0134] As noted above, embodiments within the scope of the presentinvention include program products on computer-readable media andcarriers for carrying or having computer-executable instructions or datastructures stored thereon. Such computer-readable media can be anyavailable media which can be accessed by a general purpose or specialpurpose computer. By way of example, such computer-readable media cancomprise RAM, ROM, EPROM, EEPROM, CD-ROM or other optical disk storage,magnetic disk storage or other magnetic storage devices, or any othermedium which can be used to carry or store desired program code in theform of computer-executable instructions or data structures and whichcan be accessed by a general purpose or special purpose computer. Wheninformation is transferred or provided over a network or anothercommunications connection (either hardwired, wireless, or a combinationof hardwired or wireless) to a computer, the computer properly views theconnection as a computer-readable medium. Thus, any such a connection isproperly termed a computer-readable medium. Combinations of the aboveshould also be included within the scope of computer-readable media.Computer-executable instructions comprise, for example, instructions anddata which cause a general purpose computer, special purpose computer,or special purpose processing device to perform a certain function orgroup of functions.

[0135] The invention has been described in the general context of methodsteps which may be implemented in one embodiment by a program productincluding computer-executable instructions, such as program modules,executed by computers in networked environments. Generally, programmodules include routines, programs, objects, components, datastructures, etc. that perform particular tasks or implement particularabstract data types. Computer-executable instructions, associated datastructures, and program modules represent examples of program code forexecuting steps of the methods disclosed herein. The particular sequenceof such executable instructions or associated data structures representexamples of corresponding acts for implementing the functions describedin such steps.

[0136] The present invention is suitable for being operated in anetworked environment using logical connections to one or more remotecomputers having processors. Logical connections may include a localarea network (LAN) and a wide area network (WAN) that are presented hereby way of example and not limitation. Such networking environments arecommonplace in office-wide or enterprise-wide computer networks,intranets and the Internet. Those skilled in the art will appreciatethat such network computing environments will typically encompass manytypes of computer system configurations, including personal computers,hand-held devices, multi-processor systems, microprocessor-based orprogrammable consumer electronics, network PCs, minicomputers, mainframecomputers, and the like. The invention may also be practiced indistributed computing environments where tasks are performed by localand remote processing devices that are linked (either by hardwiredlinks, wireless links, or by a combination of hardwired or wirelesslinks) through a communications network. In a distributed computingenvironment, program modules may be located in both local and remotememory storage devices

[0137] The invention is not restricted by the description of any of theembodiments previously set forth. Rather, the foregoing description isfor exemplary purposes only and is not intended to be limiting.Accordingly, alternatives which would be obvious to one of ordinaryskill in the art upon reading the description, are hereby within thescope of this invention. It will be apparent to those skilled in the artthat various modifications and variations can be made to the disclosedpreferred embodiments of the present invention without departing fromthe scope or spirit of the invention. Accordingly, it should beunderstood that the description of the method is for illustrativepurposes only and is not limiting upon the scope of the invention, whichis indicated by the following claims.

What is claimed is:
 1. An information processing method forsynchronization of disease progression data of individual patients,comprising: receiving disease progression data in an aperiodic form;representing the disease progression data as a set of functions havingfinite asymptotic values; and clustering parameters of the set offunctions, wherein the step of representing the disease progression dataas a set of functions comprises transforming the functions into timeinvariant form and thereby synchronizing individual patient data that isclustered.
 2. The information processing method according to claim 1,wherein the set of functions having finite asymptotic values comprises aset of logistic curves, and wherein the step of representing the diseaseprogression data as a set of functions comprises fitting each variablefor each patient to one of the set of logistic curves.
 3. Theinformation processing method according to claim 2, wherein theclustering step comprises using a nonlinear least squares technique. 4.The information processing method according to claim 3, wherein thenonlinear least squares technique comprises: using a local Taylor'sseries expansion for a logistic function representative of a logisticcurve; computing a jacobian matrix to calculate an error covariancematrix for the fit of the function to the data; and using the errorcovariance matrix as input to provide weighting for a distance functionfor use in a clustering algorithm.
 5. The information processing methodaccording to claim 1, wherein the clustering step includes using astatistical error in determining the function parameters as a weightingfor use in a distance function.
 6. The information processing methodaccording to claim 1, wherein the step of transforming the functionsinto time invariant form comprises determining an unambiguous time pointbased on the shape of a function curve.
 7. The information processingmethod according to claim 6, wherein the unambiguous time pointcomprises one of a x-intercept or inflection point of the functioncurve.
 8. The information processing method according to claim 7,wherein the set of functions comprise logistic functions, and whereinthe step of transforming parameters of the set of logistic functionsinto time invariant form for disease progression data of a patientcomprises: determining an x-intercept of a linear portion of thelogistic function curve for each variable; and determining pair wisedifferences of x-intercepts for all the variables with respect tocorresponding x-intercepts of base patterns, and using the determinedpair wise differences information in the clustering step to cluster andsynchronize the disease progression data of the patient to one of thebase patterns.
 9. The information processing method according to claim1, wherein the step of receiving the disease progression data comprisesfiltering out periodic or cyclical data or transforming the periodic orcyclical data into aperiodic form.
 10. A computer program product thatcauses, when executed, a computing system to synchronize diseaseprogression data of individual patients by performing the stepscomprising: receiving disease progression data in an aperiodic form;representing the disease progression data as a set of functions havingfinite asymptotic values; and clustering parameters of the set offunctions, wherein the step of representing the disease progression dataas a set of functions comprises transforming the functions into timeinvariant form and thereby synchronizing individual patient data that isclustered.
 11. The computer program product according to claim 10,wherein the set of functions having finite asymptotic values comprises aset of logistic curves, and wherein the step of representing the diseaseprogression data as a set of functions comprises fitting each variablefor each patient to one of the set of logistic curves.
 12. The computerprogram product according to claim 11, wherein the clustering stepcomprises using a nonlinear least squares technique.
 13. The computerprogram product according to claim 10, wherein the step of transformingthe functions into time invariant form comprises determining anunambiguous time point based on the shape of the function curve.
 14. Thecomputer program product according to claim 13, wherein the unambiguoustime point comprises one of a x-intercept or inflection point of thefunction curve.
 15. The computer program product according to claim 13,wherein the set of functions comprise a set of logistic functions andwherein the step of transforming parameters of the set of logisticfunctions into time invariant form for disease progression data of apatient comprises: determining an x-intercept of a linear portion of thelogistic function curve for each variable; and determining pair wisedifferences of x-intercepts for all the variables with respect tocorresponding x-intercepts of base patterns, and using the determinedpair wise differences information in the clustering step to cluster andsynchronize the disease progression data of the patient to one of thebase patterns.
 16. The computer program product according to claim 10,wherein the step of receiving disease progression data comprisesfiltering out periodic or cyclical data or transforming the periodic orcyclical form into aperiodic form.
 17. A system for synchronization ofdisease progression data of individual patients, comprising: an inputunit for receiving disease progression data in an aperiodic form; and acomputing unit configured to represent the disease progression data as aset of functions having finite asymptotic values, cluster parameters ofthe set of functions, wherein the step of representing the diseaseprogression data as a set of functions comprises transforming thefunctions into time invariant form and thereby synchronizing individualpatient data that is clustered.
 18. The system according to claim 17,wherein transforming the functions into time invariant form comprisesdetermining any unambiguous time point based on the shape of a functioncurve.
 19. The system according to claim 18, wherein the unambiguoustime point comprises one of a x-intercept or inflection point of thefunction curve.
 20. A system for synchronization of disease progressiondata of individual patients, comprising: means for receiving diseaseprogression data in an aperiodic form; and means for representing thedisease progression data as a set of functions having finite asymptoticvalues, and means for clustering parameters of the set of functions,wherein the step of representing the disease progression data as a setof functions comprises transforming the functions into time invariantform and thereby synchronizing individual patient data that isclustered.